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Abstract 

This  paper  describes  deferred  compilation ,  an  alternative  and  complement  to  compile-time  program 
analysis  and  optimization.  By  deferring  aspects  of  compilation  to  run  time,  exact  information 
about  programs  can  be  exploited,  leading  to  greater  opportunities  for  code  improvement.  This  is 
in  contrast  to  the  use  of  static  analyses,  which  are  inherently  conservative. 

Deferred  compilation  automates  the  translation  of  ordinary  programs  into  native  machine  code 
that  performs  fast  optimization  and  native-code  generation  at  run  time.  Automation  is  obtained 
through  the  use  of  a  compile-time  staging  analysis ,  which  determines  the  portions  of  a  program 
that  may  be  safely  and  profitably  compiled  at  run  time.  Fast  run-time  optimization  is  obtained 
by  trading  space  for  time:  compile-time  specialization  yields  numerous  run-time  code  generators, 
each  customized  to  optimize  a  small  portion  of  the  source  program  based  on  run-time  information. 
Implementation  strategies  developed  for  a  prototype  compiler  are  discussed,  and  the  results  of 
preliminary  experiments  demonstrating  significant  overall  speedup  are  presented. 
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I  Iitrodnctioa 

Idaay  caipsfcr  lytutw  uckalfM  depid  am  lUtic  imIjiu  to  linwniii  itwiuu  about 
a  | — gm*T  ra-tiOM  behavior  As  a  wait,  a  great  daal  at  laiiith  has  bon  invested  ta  tbo 
deveiapus  at  aiba  approach*  to  italu  pwyn  analysis,  particularly  ta  tbo  nooo  at  dataflow 
wtjii.  obotrart  ataywtatba,  aad  loiiimliH  leans  at  typs  inficaace.  Despite  good  progwa, 
sack  analyses  toad  to  bo  utwowly  iwwiitut  ut  ptatict,  tbu  makiag  it  diAcult  for  a  compiler 
to  achieve  *hi — Tgfr  optiaiiati  of  prograai.  Thu  is,  of  eoano,  a  fundamental  problem  since 
moot  aspect*  of  program  nu  tiw  behavior  arc  uadoridahb  Also,  aa  a  practical  matter,  further 

- iprenieoo  ia  prodoioa  must  be  made  ta  order  to  cope  with  tbe  complexity  and  ineAcincy  of 

maay  analysis  algorithms 

Aa  alternative  approach  is  to  defer  at  lent  some  of  tbe  analysis  and  optimisation  (aad  therefore 
also  code  gn elation)  to  ran  time.  While  thse  doee  not  avoid  the  fundamental  problome  of  undecid¬ 
ability  and  ineAcincy,  it  doee  make  pomible  the  nee  of  nan-time  values  in  improving  code  quality. 
This  is  aa  old  idee  that  has  ben  applied  ia  many  diflnmt  ways.  For  example,  for  regular  expres¬ 
sion  search,  Thompson  describes  what  essentially  amounts  to  a  compiler  for  regular  expressions. 
A  program  can  invoke  this  compiler  at  run  time  to  obtain  machine  code  optimised  for  a  specific 
regular  express! an  [Thodfi].  A  approach  has  aim  been  applied  to  bitbit  [PUL85]  and  to  tbe 

implementation  of  operating  system  services  [Ma*92,  IfPlf).  For  general  programming,  Keppel, 
Eggars,  and  Henry  have  studied  several  manual  methods  for  obtaining  such  “application- specific” 
compilers,  and  they  show  that  good  results  are  possible  for  realistic  C  programs  [KEH93]. 

There  are  other  ways  to  improve  program  performance  using  ran-time  information.  Fbr  exam¬ 
ple,  the  compiler  can  arrange  for  programs  to  collect  run- tune  data  during  development  and  test¬ 
ing,  and  than  use  the  collected  profile  information  in  optimising  tbe  code  for  final  delivery  [Wal9i]. 
Koopman  and  let  obtained  improvements  in  the  performance  of  a  lasy  functional  language  by 
implementing  graph  reduction  as  self- modifying  code  [KLS92].  And,  of  course,  there  have  been 
countless  other  applications  of  self- modifying  code. 

In  this  paper,  we  report  on  our  experience  with  a  new  approach  to  generating  optimised  code  at 
run  time.  We  have  implemented  a  prototype  compiler,  which  we  call  Fabids,  that  can  automatically 
compile  a  general  program  into  RISC  machine  code  that  in  turn  generates  optimised  "nai-biw* 
code  at  run  time.  There  are  several  notable  examples  at  compilers  for  object-oriented  lawg»*g— 
that  perform  aspects  of  compilation  at  run  time,  including  the  Smalltalk- 80  system  by  Deutsch 
and  Schifbnan  [DS84]  and  the  SELF  compiler  by  Chambers  and  Ungar  [CU89].  The  approach 
we  have  taken  differs  in  a  number  of  crucial  ways.  Perhaps  most  fundamentally,  we  compile  a 
functional  programming  language  and  hence  are  able  to  take  advantage  of  previously  developed 
techniques  for  compiling  and  transforming  functional  programs,  including  aspects  of  offline  partial 
evaluation  [JSS89,  JGS93].  This  also  facilitates  the  development  of  an  automatic  staging  analysis 
that  allows  code  to  be  dynamically  generated  for  any  part  of  a  program,  rather  than  being  restricted 
to  particular  points  such  as  method  (procedure)  invocations. 

Other  salient  characteristics  of  our  approach,  which  we  term  deferred  compilation,  are  as  follows: 

•  It  is  automatic.  No  programming  or  programmer  intervention  is  required.  An  automatic  stag¬ 
ing  analysis  determines  those  parts  of  the  program  to  be  subjected  to  run-time  compilation, 
with  or  without  the  guidance  of  the  programmer. 

e  It  is  general  Dynamic  code  generation  is  not  limited  to  particular  constructs  or  code  tem¬ 
plates.  Furthermore,  many  standard  compilation  techniques,  such  as  register  allocation  and 
inlining,  can  he  employed. 
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•  h  h  Jb«t  No  general  cnmpilstiee  or  yrnrttrif  o f  tkr  roarer  program  occurs  ol  nu  liar 
Into,  oak  pert  of  the  corn  priori  corio  that  psrfofM  rw-liat  corio  generation  u  specialised 
to  optimise  uri  foili  corii  Cor  *  mil  portico  of  the  input  proyoa. 

la  piolimiaorj  wyoia-Oi,  n  haws  found  tkat  tkr  overhead  of  rioforrori  compilation  it  often 
quite  mail  when  coaparori  to  tkr  performance  gala.  Furthermore,  m  km  encountered  unique 
design  tradeoff*  w  considering  which  aspect*  of  optimisation  aari  corir  generation  ikoold  bt  per 
formed  statically  aari  which  should  b*  kfcnof  to  roa  time.  Wo  ao  ooao  encouraging  cigar  tkat 
rioforrori  compilation  caa  bo  practical,  aari  Aari  that  tboro  is  much  farther  work  to  b*  door. 

To  introduce  dolorrori  comprint  too,  wo  begin  with  a  staple  example  tkat  illostratro  tko  back 
points.  Thro  in  Section  )  wo  give  ao  overview  of  come  strategics  and  technique*  for  deferred 
compilation.  Our  desire  to  keep  the  cat  of  run  time  code  generation  as  low  a  possible  leads 
to  several  important  practical  considerations,  la  Section  4  we  describe  some  of  the  details  of  a 
prototype  implementation  and  present  the  results  of  preliminary  experiments  with  the  system. 
This  is  followed  by  sections  on  the  secondary  costs  of  rua-time  code  generation  and  the  connections 
between  deferred  compilation  and  partial  evaluation. 

2  An  Example 

A  simple  example  illustrates  some  of  the  techniques  employed  by  deferred  compilation.  Consider 
a  program  that  contains  a  (tail-recursive)  definition  of  the  exponentiation  function: 

poser  (exp,  base,  accrue)  ■ 
if  exp  •  0  then  eccua 
else  power (exp  -  1,  base,  accun  •  base); 


...  poser  (e.  b,  1)  ... 

A  conventional  compilation  of  poser  might  yield  the  following  machine  code:1 


poser: 


Lt: 


beq 

rl,  rO,  LI 

;  if  exp  3  0  goto 

sub 

*1.  rl,  1 

;  exp  =  oxp  -  1 

nul 

r3,  r3,  r2 

;  accun  =  accun  • 

j-P 

power 

;  goto  powor 

wove 

rl,  r3 

;  result  ■  accun 

ret 

;  return 

11 

base 


Suppose  the  program  calls  poser  repeatedly,  but  with  the  first  argument  changing  more  slowly 
than  the  second  argument.  This  would  arise,  for  example,  in  a  loop  where  each  iteration  computes 
a  new  base  and  calls  poser  without  varying  the  exponent.  One  can  also  imagine  a  curried  version 
of  poser  which  is  applied  to  an  exponent  value  and  then  passed  to  a  mapping  function.  In  such 
situations,  we  say  that  the  first  argument  is  computed  in  an  early  stage  and  the  second  argument 
is  computed  in  a  late  stage. 

A  staging  analysis  can  be  used  to  identify  such  computation  stages  and  label  those  subexpres¬ 
sions  in  the  program  that  depend  only  on  the  early  arguments,  as  opposed  to  those  that  require 
late  argument  values.  In  the  case  of  just  two  stages,  this  labeling  of  early  and  late  computations 


’For  riapHriq  of  presentation,  we  assume  aa  idealised  RISC  architecture  with  ao  delay  slots;  see  Appendix  A. 
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corresponds  precisely  la  a  binding  i.me  intnui  A/id  Annotation  (  •n'i.J,  NN'IJ  .  And  >n  fart  our 
prototype  .  .rr.jii.er  .nr orp. >r»i pi  a  binding  tin.*  AjiAJ*«er 

EatI*  romp.it  At  .ant  at*  tompiied  .n  th*  aurm«J  but  Iai*  comput  Ationi  a re  translated  .nlo 

cod*  tbit  rmilx  th*  corresponding  ir.ttnifli  ir.i  At  run  time  In  this  riAtr.pl*.  tinr*  th*  ripanrni  it 
available  At  An  *arU  tlA  gr  thr  .  ami.:  ions!  tr»t  And  tubtradion  expressions  at*  r.  implied  normali  ■ 
b.,t  thr  compilation  of  lb*  mult ip'ir a: ion  riprrtnon  i  deferred  to  run  time  In  th*  umpietl  form 
of  .leferred  compilation.  we  might  abtAin  the  following  rod*  J 


pougea 

rl.  rO,  LI 

aub 

rl .  rl .  l 

aaut 

mil  rJ. 

rJ 

J*»P 

poegea 

Li 

emit 

move  r  1  . 

rJ 

ret 

Note  •  nat  inly  difference  between  po*#r  Ami  poegen  it  thAt  the  multiplic Atian  .nttrurtion 
it  emitted  , perhaps  many  timet)  mile  Ad  of  being  eiecu'ed,  a*  it  the  instruction  thAt  move*  th* 
accumulator  to  the  result  remitter  When  called  with  exp  -  &,  poegen  '•nmpietelv  unr<  il>  the  imp 
And  generates  rode  with  aU  “constants"  folded  And  Ail  'dead  rode'  eliminAted 


mu  1 

r3 , 

r  3 , 

r: 

mul 

r3. 

r3. 

x2 

muX 

r3, 

r3. 

r2 

mul 

r3 , 

r3 , 

t2 

mul 

r3. 

r3. 

t2 

move 

rl . 

r3 

Deferred  compilation  can  be  fast  enough  to  pay  off  quickly  On  a  typical  RISC  architecture  a 
fixed  Argument  instruction  can  be  emitted  in  a*  few  at  four  cycle*  (tee  Appendix  A).  Under  this 
assumption  the  co*t»  incurred  by  poegma  are  recovered  after  only  three  iteration*  of  th*  run  time 
generated  code  when  exp  -  S. 

Making  deferred  compilation  practical  for  a  wide  variety  of  program*  it  more  of  a  challenge 
than  thit  simple  example  might  imply  Here  we  tee  that  run-time  loop  unrolling  ran  be  highly 
profitable,  but  clearly  there  are  limit*;  if  pur«ued  too  aggressively,  the  run-time  overhead  may 
exceed  the  performance  gain  of  th*  dynamically  generated  code.  Another  complication  item*  from 
the  fact  that  real-world  program*  often  contain  many  more  than  the  two  stage*  of  computation 
exhibited  by  thi*  example,  a  large  number  of  which  may  benefit  from  run  time  optimiiation.  Thui, 
a  conventional  binding-time  analysis  i*  not,  in  general,  powerful  enough  for  our  needs. 

The  next  section  discusses  these  issue*  in  more  detail  and  proposes  several  strategies  for  ad¬ 
dressing  them.  We  also  examine  how  a  wider  range  of  optimisations  and  code  generation  techniques 
can  be  adapted  to  deferred  compilation.  The  effectiveness  of  some  of  these  techniques  is  examined 
in  the  context  of  a  more  realistic  example  in  Section  4. 


’W«  UK  the  pseudo- instruction  aw  it  to  simplify  the  presentation.  It  expands  uitn  *  sequence  of  instructions  that 
Allocates  space  ui  a  dynamic  code  segment,  builds  the  representation  of  aa  mat  ruction  bom  its  opcode  and  arguments, 
and  finally  writes  th*  instruction  to  the  allocated  space  In  this  example  the  arguments  in  th*  emitted  instruction 
are  fixed,  th*  first  salt  creates  *sil  rJ,  rJ,  rJ“,  regardless  of  the  current  rallies  of  rj  and  rJ. 
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3  Strategies  for  Deferred  Compilation 


The  example  above,  though  unrealisticaUy  simple,  illustrate*  the  basic  elements  of  deferred  com 
pilat.on  Fust,  s  staging  analysis  is  used  to  determine  the  stage  at  which  each  subexpression  of 
«  program  it  computed  In  essence,  this  identifies  data  and  control  dependencies  in  the  program 
sr-.t  reveals  .n  which  stages  run  time  optimization  may  be  of  benefit.  Second,  no  general-purpose 
r.pdation  occurs  at  run  time  Instead,  parts  of  the  source  program  are  compiled  into  special- 
?  -r;>  ci. «ie  generators  1 1  g .,  poegenl,  each  customized  to  optimizing  and  generating  code  based 
■  ii  run  time  *a.ues 

3  1  Stage*  of  Computation 

p-*  >tages  >t  computation  occur  naturally  in  both  functional  and  imperative  programs.  For 
-lair.  .'  r  a  ten  a  str-.-t  curried  function  t  of  tvpe  a  —  J  —  7  is  ap^ied  to  an  argument  x,  a  closure 
•etiresci.i  :  g  a  a.  ,e  .1  « p*  J  -  >  *di  typically  be  constructr  rfore  computations  involving 

additional  arguments  pr  red.  It  mav  oe  profitable  to  generate  .  .rruzed  code  for  f  (x)  if  it  will 

•  »  app..r  :  .-nat.i  :.tne»  ,‘eferred  compilation  can  therefore  be  viewed  as  an  alternative  to  the 
■invent. c.naj  mplemeraation  if  closures  1 

\  ar  pi.ri.oinn.on  kc urs  .n  programs  with  nested  loops.  The  outer  loop  index  is  alwavs 
.i.puted  nr;  .re  rxrcol.ng  .i.ner  loops,  and  substantial  benefits  might  be  obtained  by  specializ- 
r.g  .oner  .oops  to  ts  -.alue  at  each  iteration  More  deeply  nested  loops  lead  to  more  stages  of 
.  mputai,.  n 

(  imputation  stages  arise  naturally  from  other  programming  language  constructs.  Macros  in 
>1  heme  and  other  languages  are  early  computation  stage*  that  have  been  manually  identified  by 

•  he  pr-.gran.rner,  macro  expansion  perform*  (he**  computation*  before  compiler  optimization*  are 
tpp.  eil  In  hfarnlard  ML  the  phase  distinction  property  of  the  modules  sublanguage  guarantees 
that  arguments  to  functors  are  available  at  an  earlier  stage  than  arguments  to  functions  HMM90j. 
Hence,  deferred  compilation  can  be  used  to  compile  functor*  imo  code  that  will  generate  optimized 
function  cod*  at  functor  application  tun*.  Functor  application  is  similar  to  the  Uniting  of  object 
oide  in  r  .nventional  languages,  the  speed  of  which  is  not  a  high  priority,  so  deferring  highly 
aggressive  >pt .mirations  to  this  stage  appears  practical  <HBHM93j.  Link-time  optimization  has 
a,. 0  been  >t  nlied  bv  'vrivastava  ar.d  Wall  SW93’. 

In  practice,  programmers  often  arrange  for  computation*  to  be  staged  so  that  the  costs  of 
rar.v  computations  can  be  amortized  over  many  late  computation*  JCPW93],  For  example,  in 
a  Standard  ML  implementation  of  a  network  communication*  *y*tem,  Biagioni  et  at.  [BHL93] 
describe  the  structure  of  a  tend  proc*dure  with  the  type 

eand  :  connection  ->  aetsage  ->  unit 

The  computation  >*  staged  so  that  send  analyses  the  soanectioa  and  then  selects  one  of  several 
possible  message  sending  procedures  fof  type  Maeaga  ->  unit).  Since  many  messages  are  usually 

sent  on  a  connection,  this  allows  the  cost  of  connection-specific  processing  to  be  amort'sed  over  all 
of  the  message  sends.  Deferred  compilation  can  exploit  such  staging  even  if  it  is  not  expUcit  in  the 
program  text 


'Appel  baa  made  a  wmiiac  obaervalioo  A pp* 7  ,  aj»d  “ail  code*  cloaies  bev*  been  proposed  by  Feeley  and 
Lapaime  Fill' 
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We  have  restricted  our  attention  to  two  computation  stages  in  this  paper  in  order  to  simplify  the 
presentation.  In  general,  however,  programs  exhibit  many  more  stages,  and  deferred  compilation 
can  in  principle  exploit  an  arbitrary  number.  Consider  the  case  of  a  function  of  three  arguments, 
t  fx.y.z),  in  which  :he  argument  x  changes  more  slowly  than  y  which  in  turn  changes  more  slowly 
than  z.  In  this  case  it  may  be  profitable  to  identify  three  computation  stages  (call  them  “early," 
•middle,”  and  Tate")  and  generate  code  for  an  fgan  function  that,  given  the  first  argument, 
generates  the  code  of  another  specialized  code  generator. 


3.2  Staging  Analysis 

Programs  can  have  many  stages  of  computation,  and  so  a  key  problem  is  how  to  identify  those 
for  which  deferred  compilation  will  be  profitable.  This  is  similar  to  the  problem  of  deciding  where 
to  mime  proce<  -s  in  conventional  compilation  CHT91I  and  the  automatic  determination  of 
specialization  po.r.t;  during  partial  evaluation  JGS93,  BD911.  But  as  we  have  seen,  syntactic 
features  of  programming  languages  often  provide  dear  indications  of  stages  that  can  be  usefully 
subjected  to  deferred  compilation.  In  some  cases  the  use  of  programmer-supplied  hints,  such  as 
the  use  of  curri-d  function  syntax,  would  also  be  useful. 

Once  useful  program  stages  are  identified,  each  subexpression  of  the  program  can  be  analyzed 
to  determine  (approximately)  to  what  stage  it  belongs.  This  is  essentially  a  dependency  analysis:  a 
subexpression  that  only  depends  upon  values  computed  at  or  before  stage  n  computet  a  value  that 
also  belongs  to  stage  n.  Although  this  is  conceptually  simple,  approximations  must  be  made  so  that 
t tie  stage*  of  computations  involving  recursion  can  be  finitely  computed.  Hence,  this  propagation 
of  staging  information  is  best  accomplished  via  a  dataflow  analysis  or  abstract  interpretation. 
Of  course,  since  the  analysis  is  necessarily  approximate,  early  stages  might  be  assigned  to  some 
expressions  that  axe  actually  late,  and  vice  verta.  Optimization  opportunities  are  lost  in  the 
former  case,  and  unnecessary  run-time  code  generation  occuxs  in  the  latter  case.  Hence,  refining 
:he  precision  of  staging  analysis  is  of  fundamental  importance. 

Further  technical  details  of  the  staging  analysis  problem  are  beyond  the  scope  of  this  paper, 
but  we  refer  the  reader  to  the  literature  on  conventional  binding-time  analysis  [JSS99,  Con93|, 
which  is  precisely  a  staging  analysis  for  the  special  case  of  two  stages.  We  have,  for  the  time 
being,  restricted  our  attention  to  two  stages,  and  we  use  a  conventional  binding-time  analyzer  in 
our  prototype  compiler  with  good  results  (see  Section  4).  Note,  however,  that  in  programs  where 
there  are  more  thsm  two  useful  stages,  a  binding- time  analysis  forces  distinct  stages  to  be  merged, 
thus  causing  opportunities  for  run-time  code  generation  to  be  lost.  To  gain  maximum  benefit 
from  deferred  compilation,  a  generalization  of  binding- time  analysis  to  an  arbitrary  number  of 
computation  stages  is  required. 

3.3  Limitation*  of  Static  Specialiaation 

The  examples  mentioned  earlier  showed  some  of  the  circumstances  in  which  computation  stages 
can  be  exploited  by  a  compiler.  We  have  yet  to  explain,  however,  why  run-time  compilation  is 
needed.  To  see  this,  consider  the  ‘Jtemative  of  using  a  source- to- source  transformation  instead 
of  deferred  compilation.  For  the  powar  example,  an  effect  similar  to  deferred  compilation  can  be 
obtained  by  transforming  povar  into  the  tollowing  code:4 


'ath(i.  (*«,  ,  ,  *,) )  yield*  the  element  of  •  tuple) 
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nth  (exp, 
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laabda  (bate)  bate  •  bate, 
lambda  (bate)  bate  •  bate  •  bate. 
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This  definition  of  powgen  can  be  obtained  by  creating  a  table  of  specialized  versions  of  poeer, 
ea<h  of  which  is  created  by  choosing  a  value  for  exp  from  the  set  {0,1,...,  i}  and  then  applying  a 
partial  evaluator  [Bon93]  to  poser  and  exp.  Similar  transformations  might  also  be  obtained  by  ap¬ 
plying  staging  transformation  [JS8G],  program  bifurcation  [DBV91],  or  procedure  cloning  [CHK93]. 
In  either  cate,  highly  optimized  definitions  of  the  specialized  functions  can  be  obtained,  which  can 
then  be  compiled  into  high-quality  machine  code.  Hence,  one  might  expect  this  approach  to  be 
useful  in  the  same  situations  as  deferred  compilation. 

However,  there  are  two  practical  problems  in  performing  such  a  transformation  automatically. 
First  of  all,  there  is  the  matter  of  choosing  the  set  of  values  on  which  to  specialize.  In  povgen,  for 
example,  there  is  no  guarantee  that  the  set  (0, 1, . . . ,  k}  is  a  good  one,  since  the  range  of  exponents 
that  will  be  supplied  at  run  time  usually  cannot  be  predicted.  In  fact,  the  specialization  would  not 
in  general  be  on  simple  integer  values,  but  possibly  on  arbitrary  data  structures. 

A  second  problem  is  that  all  of  the  specialized  functions  must  appear  in  the  transformed  source 
program.  This  incurs  a  serious  cost  in  space  usage,  and  it  wasteful  since  only  a  few  of  the  functions 
might  be  used  in  a  single  program  execution.  In  practice,  a  relatively  small  limit  must  be  placed 
on  the  number  of  specialized  functions  created  at  compile  time  (represented  by  the  constant  k  in 
the  above  example). 

Hence,  a  key  aspect  of  deferred  compilation  is  to  arrange  for  specialization  to  occur  “on  demand" 
(or  “just  in  time”).  Furthermore,  our  desire  to  minimize  the  cost  of  run-time  code  generation  leads 
us  to  specialize  the  compilation  process  itself.  In  other  words,  we  with  to  avoid  the  overhead 
rf  manipulating  source  programs,  which  one  finds  in  a  general  compiler,  and  instead  create  code 
generators  that  are  specialized  to  optimizing  a  fixed  piece  of  code  based  on  run-time  values. 

One  can  consider  incorporating  conventional  compilation  techniques  into  specialized  run-time 
cods  generators.  In  fact,  one  of  the  key  design  issues  in  deferred  compilation  is  deciding  how  to 
apportion  the  costs  of  optimization  and  code  generation  between  compile  time  and  run  time.  In 
the  next  section  we  consider  the  particular  case  of  register  allocation. 


3.4  Register  Allocation  for  Deferred  Compilation 

Conventional  compilers  often  use  graph- coloring  algorithms  to  assign  variables  to  a  limited  number 
of  registers  Cha82,  CH84].  An  interference  graph  is  constructed,  with  nodes  representing  the 
lifetime  ranges  of  variables  and  edges  indicating  where  these  ranges  intersect.  Any  A-coloring  of 
the  interference  graph  is  therefore  a  valid  assignment  of  the  variables  to  K  registers.  This  section 
describes  how  such  techniques  can  be  applied  when  compils^ion  is  deferred. 


3.4.1  Compile-Time  Register  Allocation 

We  first  consider  a  strategy  for  performing  all  register  allocation  at  compile  time.  The  significant 
complication  is  that  different  stages  in  a  program  can  nse  the  tame  set  of  registers  because  their 
execution  is  not  interleaved.  For  example,  the  powgoa  function  presented  in  Section  2  can  exploit 


the  fact  that  computations  involving  the  exponent  and  base  belong  to  different  program  stages  by 
-;signing  those  variables  to  the  same  register: 


powgen:  beq 
sub 
•ait 
jap 

LI:  «ait 

r«t 


rl,  rO, 

Ll 

rl,  rl. 

1 

BUl 

r2,  t2 

povg«n 

mov* 

rl,  t2 

The  usual  notion  of  lifetime  ranges  does  not  capture  this  distinction,  since  the  staging  being 
exploited  is  not  explicit  in  the  source  program.  For  example,  computations  involving  «xp  and  base 
are  textuall^  adjacent  but  belong  to  different  computation  stages.  Conventional  register  allocation 
algorithms  may  nonetheless  be  used  for  deferred  compilation  by  simply  modifying  the  construction 
of  the  interference  graph.  A  standard  lifetime  analysis  can  be  conducted  without  regard  to  the 
staging  of  the  program,  followed  t.  an  analysis  that  determines  the  program  stage  to  which  each 
variable  belongs.  During  construction  of  the  interference  graph,  edges  are  only  added  between 
overlapping  lifetime  ranges  of  variables  from  the  same  program  stage. 

3.4.2  Run-Tune  Register  Allocation 

Although  compile-time  register  allocation  leads  to  fast  run-time  code  generation,  it  suffers  several 
limitations.  Infilling  and  loop  unrolling  may  occur  at  run  time,  so  an  exact  interference  graph  cannot 
be  constructed  at  compile  time.  Also,  fixing  the  register  assignment  of  a  function  at  compile  time 
makes  it  difficult  to  inline  in  some  contexts.  For  example,  registers  must  be  shuffled  if  the  formal 
and  actual  parameters  are  assigned  to  different  registers,  and  so  forth.  If  the  number  of  contexts 
in  which  a  function  will  be  iniined  is  small,  compile-time  code  duplication  combined  with  fixed 
register  assignments  can  be  effective,  but  in  general  the  space  required  will  be  prohibitive. 

It  is  therefore  desirable  to  perform  run-time  register  allocation  in  some  cases.  Although  reg¬ 
ister  allocation  can  be  performed  on  a  run-time  intermediate  representation  of  code,  the  cost  of 
processing  such  a  representation  is  likely  to  pay  off  only  when  the  generated  code  is  executed  many 
times.  A  more  efficient  strategy  is  to  perform  register  allocation  at  compile  time  but  defer  register 
aatignment  until  run  time.  A  static  approximation  of  the  interference  graph  can  be  constructed  as 
described  in  the  previous  section,  and  the  run-time  code  generators  can  be  parameterised  by  some 
representation  of  the  desired  register  mapping.  For  example,  powgen  can  perform  run-time  register 
assignment  as  follows: 


pov|«a : 

rl,  rO,  Ll 

sob 

rl,  rl,  1 

salt 

■nl  r[r3] , 

r(r3],  r Cr2] 

j*P 

powg«a 

Ll: 

••it 

■ova  r[r4] , 

rCr3] 

rat 

This  function  takes  four  arguments:  the  value  of  «xp  (in  rl),  the  numbers  of  the  registers 
assigned  to  base  and  accua  (in  r2  and  r3),  and  the  number  of  the  destination  register  (in  r4). 
The  emit  pseudo- instruction  used  here  determines  the  operands  of  the  emitted  instruction  from 
the  contents  of  'he  specified  registers.  This  takes  more  time  than  emitting  instructions  with  fixed 
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operands,  but  the  generated  code  will  be  more  efficient  in  contexts  that  would  otherwise  require 
the  register  shuffiing  described  above. 

The  representation  of  register  mappings  has  a  significant  impact  on  the  cost  of  run- time  register 
assignment.  In  the  above  example,  register  mappings  are  maintained  in  registers  throughout  early 
stages  of  computation;  instructions  can  be  emitted  quickly  because  no  memory  access  is  required 
to  determine  their  arguments.  It  remains  to  be  seen  whethei  this  savings  will  in  general  justify  the 
increased  register  pressure  suffered  by  early  computations. 

3.5  Specialized  Run-Time  Code  Generators 

Performing  most  of  the  work  of  register  allocation  at  compile  time  can  greatly  improve  the  speed  of 
run-time  code  generation.  Many  other  conventional  optimization  and  code  generation  techniques 
can  be  similarly  adapted  to  deferred  compilation.  This  section  gives  a  brief  overview  of  our  work 
in  this  area. 

We  have  generalized  destination-driven  code  generation  DHB901  to  produce  specialized  run¬ 
time  code  generators  (henceforth  sir  ply  called  generators)  that  do  not  manipulate  any  represen¬ 
tation  of  the  source  program  at  run  time.  The  algorithm  is  surprisingly  straightforward  because 
it  obeys  staging  annotations  rather  blindly.  As  an  expression  is  traversed,  “early"  operations  are 
converted  to  machine  code  that  performs  the  appropriate  computation,  while  “late"  operations  are 
compiled  into  code  that  emits  the  machine  instructions  that  will  eventually  perform  the  computa¬ 
tion. 

As  the  example  in  Section  2  demonstrates,  this  simple  technique  produces  highly  effective 
run-time  optimizations.  These  optimizations  are  more  powerful  that  those  found  in  many  template 
compilers  KEH91',  and  eliminating  the  need  for  run-time  processing  of  an  intermediate  representa¬ 
tion  'it  template  can  yield  much  faster  code  generation.  Many  conventional  peephole  optimizations. 
>uch  as  strength  reduction  and  instruction  selection,  can  easily  be  incorporated.  For  example,  a 
generator  can  avoid  emitting  a  multiplication  involving  a  value  *  from  an  earlier  stage  if  it  takes 
the  time  to  determine  whether  *  =  1.  The  increased  cost  of  such  run-time  optimizations  must  be 
weighed  against  their  benefit;  a  staging  anaiysis  that  determine*  where  to  aggressively  optimize 
would  facilitate  such  decisions. 

A  generator  that  emits  native  machine  code  in  a  single  pass  will  be  faster  than  one  that  builds 
an  intermediate  representation,  performs  analysis  and  optimization,  and  then  generates  machine 
instructions.  However,  it  can  be  difficult  to  produce  good  quality  native  code  in  a  single  pass. 
Branches  and  procedure  call*  are  problematic  because  the  destination  may  be  code  that  ha*  not 
yet  been  compiled.  Due  to  run-time  mlining  and  loop  unrolling  the  generator  may  not  be  able  to 
predict  where  the  target  code  will  eventually  be  located,  *o  run-time  fcackpatching  is  necessary. 
Span  dependent  instructions  are  challenging  for  similar  reasons.  Good  instruction  scheduling  is 
also  diffic-ilt  to  achieve  in  a  single  pass.  Although  a  schedule  can  be  “hard- wired”  into  generators 
for  straight  Tine  blocks,  scheduling  across  basic- block  boundaries  requires  more  general  techniques. 

We  are  also  investigating  the  adaptation  of  iniining  and  loop  unrolling  algorithms  to  deferred 
compilation.  In  conventional  compilers  such  techniques  yield  increased  opportunities  for  optimiza¬ 
tion  and  improve  the  amortization  of  various  computations,  such  as  range  check'.  Cur  preliminary 
work  suggests  that  similar  benefits  can  be  obtained  by  run-time  tnhmng  and  loop  unrolling.  It 
can  be  difficult  to  statically  determine  where  to  inline  or  how  far  to  unroll  a  loop.  Tne  use  of 
run-time  information  to  guide  such  decisions  may  prove  to  be  of  significant  benefit.  We  have  aug¬ 
mented  the  compile-time  code  generator  described  above  with  the  pa.ua!  evaluation  technique  of 
unfolding  BD91,  JGS93!,  a  form  of  inlming. 
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In  v'rr.o  contexts  it  is  impractical  to  inline  a  function  yet  still  desirable  to  opt :  b.ise<; 
upon  the  results  of  earlier  computations.  For  example  if  a  function  is  caile  1  at  msm  d .ffrrrn 
urogram  points  with  the  same  value  from  an  early  computation,  it  may  be  preferable  to  ^rneruie  ,» 
single  optimized  version  of  the  function  rather  than  inlining  its  bodv  at  each  call  «i  te  f  ora:  re. 
is  commonly  called  specialization  JSSS9'.  Specialization  also  permits  run  time  opt.  rtf  zed  r 
be  reused  rather  than  regenerated,  which  saves  both  space  and  time.  The  mrmmzaiior.  of  run 
time  code  generators  is  a  simple  wav  to  achieve  such  reuse.  Run  time  memoization  caii  be  quite 
expensive,  particularly  when  memoizing  on  structured  data  Ma!93  .  Nevertheless,  prehmmarv 
experiments  indicate  that  it  is  worthwhile  in  some  applications.  The  development  r  static  ami 
dynamic  strategies  for  controlling  memoization  is  an  interesting  open  problem. 


4  Implementation 

We  have  implemented  a  prototype  compiler  called  Fabh'S5  that  incorporates  many  of  the  deferred 
compilation  strategies  described  in  the  previous  section,  as  described  below.  The  primarv  goai 
of  F a 3 It'S  is  to  reduce  the  run-time  cost  of  code  generation  to  .  minimum,  at  the  cost  of  some 
degradation  in  the  quality  of  the  generated  code  and  an  increase  in  the  size  of  both  the  generating 
and  he  generated  code.  This  provides  a  baseline  for  the  evaluation  of  compilers  that  perform  more 
aggressive  run-iime  optimizations. 

The  Fabujs  source  language  is  a  rudimentary,  strict,  first-order  functional  language.  Integers 
and  pointers  to  heap-allocated  structures  are  the  only  run-time  values;  Fabils  does  not  support 
arravs  or  assignment.  We  have  currently  limited  our  attention  to  two-stage  programs,  so  ’hat  the 
pro.-i.em  of  staging  analysis  becomes  one  of  binding-time  analysis  NN92  .  The  staging  analvsis  aiso 
de'-rmines  how  function  calls  should  be  treated  by  the  code  generator.  An  aggressive  heuristic  is 
used  to  determine  which  function  applications  should  be  inlined:  function  calls  in  the  branches  of 
.ate  conditionals  are  specialized,  but  all  other  calls  are  inlined  BD9lj.  All  analysis  is  automatic, 
requiring  no  programmer  intervention. 

All  register  allocation  and  assignment  occurs  at  compile-time;  registers  are  assigned  indepen¬ 
dently  to  variables  in  early  and  late  computations.  In  keeping  with  the  focus  on  fast  run-time  code 
generation,  very  few  optimizations  are  applied  at  run  time.  The  primary  optimizations  are  “con¬ 
stant  propagation,  “constant”  folding,  dead-code  elimination,  and  function  inlining.  Loops  are 
expressed  as  tail-recursive  functions,  so  in’ining  effectively  yields  loop  unrolling.  We  have  ignored 
the  issue  of  instruction  scheduling  for  the  moment;  we  assume  an  idealized  RISC  machine  with 
no  delay  siots  (see  Appendix  A).  Run-time  code  generation  occurs  in  a  single  pass;  no  intermedi¬ 
ate  reprr  *-ntation  is  constructed  and  no  analysis  or  optimization  is  performed  on  code  after  it  is 
generated. 

Our  preliminary  results  are  encouraging.  As  an  example  we  consider  vector-matrix  multiplica¬ 
tion.  which  is  often  used  to  implement  matrix-matrix  multiplication  and  is  common  in  scientific 
computing  applications.  Berlin  and  Weise  have  investigated  improvements  to  similar  scientific  code 
through  compile-time  application  of  partial  evaluation  [BW90].  Vector-matrix  multiplication  is  a 
prime  candidate  for  run-time  code  generation  because  t*-e  vector  is  fixed  throughout  the  computa¬ 
tion.  and  the  loop  that  computes  the  inner  product  of  the  vector  with  a  row  or  column  from  the 


‘Quintus  Fabius  Maximus  was  a  Roman  general  best  known  for  hi*  defeat  of  Hannibal  in  the  Second  Punic  War 
His  primarv  strategy  was  to  delay  confrontation;  repeated  small  attack*  eventually  led  to  victory  without  a  single 
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Figure  1:  Time  to  multiply  an  n-vector  with  an  n  x  n  matrix 


matrix  can  be  completely  unrolled.8  Such  optimizations  cannot  usually  be  performed  at  compile 
lime,  however,  because  the  sizes  and  contents  of  the  vector  and  matrix  are  generally  not  statically 
apparent.  The  source  code  for  the  example  is  given  in  Appendix  B,  along  with  one  of  the  run-time 
code-generators  produced  by  Fabius. 

Figure  1  compares  the  total  execution  time  of  vector-matrix  multiplication  for  varying  input 
sizes  under  conventional  and  deferred  compilation.  The  inputs  were  vectors  of  length  n  and  square 
matrices  of  dimension  n  containing  pseudo-random  integers,  and  the  execution  times  are  given  in 
machine  cycles  (see  Appendix  A).  The  “conventionally  compiled”  code  was  produced  by  disabling 
the  Fabius  staging  analysis  and  is  of  high  quality.  The  dotted  line  represents  the  portion  of  time 
spent  generating  code  at  run  time;  this  time  is  included  in  the  total  execution  time  of  the  code 
produced  by  deferred  compilation.  As  the  figure  demonstrates,  deferred  compilation  can  yield 
significant  improvement  in  overall  execution  time  even  for  small  problem  sizes.  In  this  case  the 
cost  of  run-time  code  generation  was  recouped  when  multiplying  a  16  element  vector  with  a  16x16 
matrix.  The  speedup  increases  linearly  with  larger  input  sizes  (ignoring  the  secondary  costs  detailed 
in  Section  5),  yielding  a  speedup  of  greater  than  20%  when  n  =  32. 

The  amount  of  dynamically  allocated  data  space  was  roughly  the  June  under  conventional 
and  deferred  compilation.  However,  as  expected,  we  observed  a  significant  increase  in  code  size. 
The  conventionally  compiled  code  occupied  just  over  50  words;  under  deferred  compilation  the 
size  of  the  static  code  rose  to  nearly  275  words  and  the  size  of  the  run-time-generated  code  rose 
linearly  from  250  words  to  approximately  800  words  a*  n  ranged  from  4  to  32.  Increases  of  this 
magnitude  awe  to  be  expected  when  aggressively  inlining,  since  we  are  trading  space  for  time,  but 
it  remains  to  be  seen  whether  such  increases  are  manageable  in  larger  applications.  More  extensive 
experimentation  is  currently  underway. 


*  The  arithmetic  operations  can  also  be  optimised  based  on  the  contents  of  the  vector,  which  will  likely  yield 
substantial  tpeedup*  for  computations  involving  i parse  data.  The  results  presented  here  do  not  reflect  such  improve¬ 
ments,  since  we  have  focused  on  fast  run-time  code  generation  at  the  expense  of  some  run-time  optimisations. 
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5  Costs  of  Deferred  Compilation 

t  he  time  required  to  optimize  and  generate  rode  at  run  time  has  been  our  primary  focus,  but 
the  time  space  tradeoff  exploiter!  by  deferred  compilation  has  son  e  secondary  costs  that  must  be 
considered  in  practice. 

Code-space  reclamation  can  be  a  significant  cost  in  programs  that  pursue  aggressive  run-time 
code  generation.  Conventional  garbage  collection  techniques  will  likely  suffice,  although  ,ome  new 
strategies  may  prove  profitable  because  dynamically  allocated  code  objects  differ  fmm  data  object- 
in  both  size  and  lifetime.  Garbage  collection  might  be  complicated  by  the  fact  that  run  time 
generated  code  may  contain  embedded  pointers  to  other  data  and  code  objects:  this  can  occur  if 
pointers  are  inlined  like  other  values  during  optimization.7 

Run-time  code  generation  and  modification  can  interact  poorly  with  modern  memory  hier¬ 
archies  KLS92  .  Most  modern  architectures  prefetch,  instructions  into  an  instruction  '•arise  and 
many  do  not  automatically  invalidate  cache  entries  when  memory  writes  orrur.  Cache  flushing 
mav  therefore  be  required  when  dynamically  generating  or  modifying  code  Kep'.ll  .  The  regularity 
of  code-space  allocation  and  initialization  may  simplify  amortizing  the  cost  of  such  operations. 
For  example,  the  instruction  cache  could  he  flushed  after  code-space  reclamation,  ami  each  newly 
allocated  cede  object  could  be  aligned  to  a  boundary  that  the  instruction  prefetch  mechanism  is 
guaranteed  not  to  have  crossed  while  executing  previously  generated  code,  thus  avoiding  the  imal 
idation  of  cached  instructions.  An  architecture  with  a  write  buifer  or  a  write  back  data  cache  mav 
require  additional  work  to  ensure  that  recently  written  instructions  are  fetched  prope  jv. 

Another  open  question  is  how  run  time  code  generation  affects  locality.  Memory  hierarchies 
offer  substantial  rewards  to  programs  with  highly  localized  data  and  instruction  access  patterns. 
Deferred  compilation  reduces  locality  by  creating  numerous  optimized  code  blocks  instead  exe¬ 
cuting  a  more  general  code  block  multiple  times  Run-time  inlining  can  increase  code  size  signifi¬ 
cantly,  thus  decreasing  locality.  However,  run-time  dead-code  elimination  and  other  optimizations 
can  also  reduce  code  size.  Techniques  adapted  from  partial  evaluation  MogS*  mav  also  improv" 
data  locality  by  reorganizing  data  structures  based  on  the  staging  of  a  program. 

G  Deferred  Compilation  vs.  Partial  Evaluation 

There  are  strong  similarities  between  deferred  compilation  and  offline  partial  evaluation  ’JGS93, 
BD911,  but  some  significant  differences  deserve  mention.  A  partial  evaluator  can  be  viewed  as  a 
generalized  interpreter  that,  given  a  program  and  a  port  on  of  its  input,  produces  a  specialized 
residual  program  that  accepts  the  remaining  input  and  produces  the  desired  result. 

The  correctness  of  a  partial  evaluator,  called  m*x  for  historical  reasons  [JSS89],  is  described 
by  the  following  equation,  which  specifies  that  the  result  produced  by  the  residual  program  must 
be  the  same  as  the  result  of  the  original  program  p  when  applied  to  the  same  inputs  (\p\  denotes 
evaluation  of  a  program  p,  yielding  a  function): 

l[m»zj(p,d,  )Jd;  =  [p\(di,dj) 

Perhaps  the  most  intriguing  aspect  of  partial  evaluation  is  self-application.  If  mix  is  imple¬ 
mented  in  the  language  that  it  interprets,  it  can  specialize  itself  to  a  particular  source  program  p. 


’These  embedded  pointers  mar  be  difficult  to  locate  and  update;  for  example  on  the  MIPS  a  constant  J2-hii 
pointer  might  be  embedded  into  two  instructions  that  contain  18-b«t  immediate  value*.  Instruction  reordering  during 
run-time  rode  generation  can  make  the  locations  of  these  instruction*  unpredictable. 
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yielding  a  program  that  generate*  a  residual  program  when  executed: 

[Jmizl(mix,p)Jdi  =  [mi*)(p,di) 

This  is  the  essence  of  our  tec’miques  for  fast  runtime  code  generation.  |mtz](mix,p)  is  a  generating 
extension  that  when  given  the  first  input  value  will  produce  an  optimized  residual  program.  A 
generating  extension  is  simply  a  specialized  code  generator,  and  it  can  be  constructed  at  compile 
time  because  it  does  not  depend  on  any  run-time  data.  A  further  self-application  of  mix  yields  the 
stand-alone  program,  commonly  called  cogen,  that  performs  the  construction: 

(fmsrj(m«x,mt*)jp  =  (mi*)(m«x,p) 

In  practice  this  approach  has  not  been  used  to  implement  automatic  run-time  code  generation. 
Typical  partial  evaluator;  Bon93,  CondS]  are  intended  for  source- to- source  program  transformation 
'  ,oe  Sect. on  3  3)  and  produce  residual  programs  in  Scheme  or  a  similar  high-level  language.  The 
generating  extensions  produced  by  seif-application  are  therefore  implemented  in  Scheme,  and  more 
importantly,  they  generate  Scheme  code  when  executed.  The  use  of  such  systems  for  run-time  code 
generation  would  therefore  require  general-purpose  run-time  compilation,  which  is  too  costly  to  be 
Widely  applicable. 

Implementing  a  self-applicab'e  partial  evaluator  that  dir  -ctly  generates  machine  code  would 
iolve  »uch  problems.*  Generating  extensions  would  be  direct!  executable  and  they  would  generate 
native  code  when  executed.  To  the  best  of  our  knowledge,  no  such  partial  evaluator  has  been 
ir.cril'ed  or  implemented  'o  date.  One  system  that  comes  closer  than  most  to  this  goal  is  AMIX, 
a  >elf  applicable  partial  evaluate.-  for  a  first -order  functional  language  whose  target  is  an  abstract 
>:ack  macr.me  Hol88\  A MIX’s  abstract  machine  code  is  a  relatively  high-level  language,  however, 
v  1  :he  con  of  compiling  it  to  native  code  at  run  time  would  be  substantial.  The  inierpretational 
overhead  present  in  this  compilation  cannot  be  statically  elin  nated. 

A  promising  alternative  to  self-application  is  the  hand-impiementation  of  cogen  [BW93j.  In  fact 
me  can  view  Fash,’*  as  a  hand-implemented  cogen  whose  target  language  is  RISC  machine  code. 
Dus  view  is  supported  by  our  concentration  on  two-stage  programs  and  our  wholesale  adoption 
<f  numerous  techniques  originally  developed  for  partial  eval  iator»,  such  as  binding  time  analysis, 
unfolding  heuristics,  and  memoised  specialization.  However,  the  goals  and  strategies  of  FaBIUS, 
such  as  one-pass  native-code  generation  and  static  register  allocation,  dhTer  from  tho««  of  any 
exntmg  formulation  of  roqen. 

7  Conclusions 

We  have  developed  a  new  approach  to  compilation.  It  provides  an  alternative  to  compile-time 
analysis  and  optimization  by  deferring  aspects  of  optimisation  and  code  generation  to  run  time. 
Vuti.matic  staging  analysis  is  employed  to  detect  program  stages  in  which  run-time  optimization 
m.*v  be  beneficial  Fast  run  time  optimization  and  rode  generation  is  achieved  by  eliminating  the 
iverhesd  >.f  processing  intermediate  representations  of  source  programs  at  run  time.  Preliminary 
e.«  "-riments  with  a  prototype  compiler  are  promising,  but  w«  find  that  further  experimentation  is 
required  for  a  full  assessment. 


'N««  ikM  Met  »  p«rn*i  rnlstlM  »*rd  mi  bs  implements*  m  its  tsryet  lanfsage. 
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Appendix  A  Idealized  RISC 


The  details  provided  in  this  appendix  may  assist  interpretation  of  the  example  in  Section  2  and  the 
results  presented  in  Section  4.  Fabiis  currently  generates  code  for  an  idealized  RISC  machine  that 
closeiy  resembles  the  MIPS  architecture.  The  primary  difference  is  a  lade  of  delay  slots:  memory 
access,  bi  nch,  and  call  instructions  require  only  one  cycle  to  complete.  The  idealized  RISC  also 
supports  a  richer  instruction  set,  including  operations  like  nova,  push,  and  call;  procedure  linkage 
uses  the  stack. 

The  asut  pseudo-instruction  is  interpreted  by  our  RISC  simulator  rather  than  being  expanded 
by  the  code  generator,  which  facilitates  the  investigation  of  various  peephole  optimizations.  The 
timings  described  in  Section  4  attribute  a  cost  of  tour  cycles  and  a  size  of  four  words  to  most 
emit  instructions.  On  the  MIPS,  two  evdes  would  be  required  to  load  the  32-bit  representation 
of  a  fixed  operand  instruction  into  a  register.  Two  additional  cycles  are  required  to  store  the 
..’istruct.on  and  update  a  coiie->egmcnt  pointer;  the  pointer  update  fills  the  delay  slot  of  the  store 
instruction.  The  cost  of  updating  the  pointer  could  be  amortized  over  several  *mits,  so  we  can 
reduce  the  average  cost  if  another  instruction  is  available  to  till  the  delay  slot.  Fast  allocation  of 
c.'de  space  is  a  critical  requirement.  We  assume  a  garbage-collected  code  segment  with  amortized 
or  hardware  supported  overflow  checking  and  cache  flushing. 


Appendix  B  Extended  Example 

This  section  details  the  -.ertor  matrix  multiplication  example  presented  in  Section  t.  Although 
1  v  •!"  <  -h.es  rot  ;.et  supp-rt  urruvs.  a  realistic  evaluation  •  /•he  benefits  of  run  time  code  generation 
-•an  be  made  using  other  data  structures,  so  we  have  imp'emented  vectors  as  linked  lists  and  matrices 
as  l.'t.  -  f  sectors  in  row  major  order: 

va-aultiv.  a,  a)  * 

if  a  =  ml  then  reversals,  mil 
•Isa  1st  prod  *  dotprodlv,  ear  a,  ail) 

in  va-ault(v,  cdj  a,  eonafprod.  •)) 

Joipndl  /  I  ,  v2  ,  a  >  ■ 

If  *1  •  nil  then  a 

•Is*  dotprodfedr  »1 ,  cdx  *2,  a  «  car  *1  •  car  vj) 

The  functions  we  implemented  using  tail  recursion  to  reduce  procedure- call  overhead;  ihe  accu¬ 
mulator  can  be  viewed  as  am  explicit  encoding  of  rail  frames,  the  reeersai  of  which  corresponds  to 
a  sequence  of  procedure  returns  i  The  code  for  revere*  has  been  omitted).  computes  the 

tot  product  of  the  specified  vector  with  each  row  of  the  given  matrix  ami  accumulates  the  results 
-n  *  .nt  dotprod  nmpiy  sums  the  products  of  corresponding  elements  of  two  vectors. 

F»*ti’  s  rreates  memo.ied  code  generators  for  both  va-aalt  and  dotprod.  previously  generated 
-  ode  fo-  «  pwticular  vector  s  reused  rather  than  regenerated  An  miming  code  c  utor  is  also 
created  for  dotprod.  it  is  ,nv<.iied  bv  the  memoued  dotprod  generator  to  genrr..  ■  code  for  its 
recursive  tail  call  i  comment*  added) 
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uoi 


boq 

rl. 

rO.  110 

Id 

r2. 

(rl) 

Id 

rl. 

4(rl) 

omit 

Id 

*3 , 

(rl) 

oait 

Id 

rl. 

4(rl) 

oait 

ao*o  r4. 

Cr2] 

Mit 

ml 

r3. 

r4,  r3 

out 

odd 

r2 , 

r2 ,  r3 

jap 

L9 

emit 

koto  rl. 

r2 

rot 

;  if  »1  »  nil  (Ota  HO 
»  xl  •  cu(fl) 
i  t  1  *  cdr(vl) 

;  Mit  “x2  *  ear(»2)“ 

•  cadi  mt2  »  edr(?2)” 

;  Mit  "tap  *  [xl]” 

;  Mit  “prod  •  tap  •  x2" 
;  Mit  "a  »  *  ♦  prod” 

;  goto  L9 

;  oadt  "revolt  •  *" 

;  rotors 


The  first  argument  vector  is  supplied  in  rl.  The  run-time-generated  code  expects  the  second 
argument  vector  in  rl  and  the  accumulator  in  r2.  If  the  first  argument  vector  is  [1,2,3],  the 
following  code  is  generated  at  run  time: 


Id 

r3,  (rl) 

i  x2  »  ear (r2) 

Id 

rl.  4(rl) 

;  v2  •  edr(v2) 

■ova 

r4,  l 

(  tap  -  1 

ml 

r3,  r4,  rl 

;  prod  “  tap  • 

add 

r2,  r2,  r3 

j  a  ■  •  ♦  prod 

Id 

r3 ,  (rl) 

Id 

rl,  4(rl) 

i  ate. 

■ova 

r4.  2 

ml 

rl,  r4,  r3 

odd 

*2,  rl,  rl 

Id 

r3.  (rl) 

Id 

rl.  4(rl) 

■ovo 

r4,  i 

ml 

rl.  r4,  rl 

add 

(1,  c2,  rl 

MTV 

rl ,  rl 

1  revolt  ■  • 

17 


